Purpose

This script takes a deep dive into Landsat 7 labels for a more rigorous analysis of inconsistent band data and outliers in the filtered label dataset. Here we will determine if any more label data points should be removed from the training dataset and whether or not we can glean anything from the metadata in the outlier dataset to be able to pre-emptively toss out scenes when we go to apply the classification algorithm.

harmonize_version = "2024-04-25"
outlier_version = "2024-04-25"

LS7 <- read_rds(paste0("data/labels/harmonized_LS57_labels_", harmonize_version, ".RDS")) %>% 
  filter(mission == "LANDSAT_7")

Check for mis-matched band data between user data and re-pull

Just look at the data to see consistent (or inconsistent) user-pulled data and our pull, here, our user data are in “BX” format and the re-pull is in “SR_BX” format. These are steps to assure data quality if the volunteer didn’t follow the directions explicitly, or if there are differences in the re-pull. The re-pull masks out saturated pixels, so any instance where the “SR_BX” value is NA indicates that the pixel was saturated in at least one band.

pmap(.l = list(user_band = LS57_user,
               ee_band = LS57_ee,
               data = list(LS7),
               mission = list("LANDSAT_7")),
     .f = make_band_comp_plot)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

There is some mis-match here, let’sThere is some mis-match here, let’s filter any labels where at least one band value doesn’t match between the user pull and the re-pull. filter inconsistent labels

LS7_inconsistent <- LS7 %>% 
  filter(is.na(SR_B7) | B1 != SR_B1 | B2 != SR_B2 | B3 != SR_B3 | 
           B4 != SR_B4 | B5 != SR_B5 | B7 != SR_B7)

LS7_inconsistent %>% 
  group_by(class) %>% 
  summarise(n_labels = n()) %>% 
  kable()
class n_labels
cloud 189
darkNearShoreSediment 1
lightNearShoreSediment 5
offShoreSediment 3
openWater 3
other 2
shorelineContamination 6

Most of these are cloud labels, where the pixel is saturated, and then masked in the re-pull value (resulting in an NA). Let’s drop those from this subset and then look more.

LS7_inconsistent <- LS7_inconsistent %>% 
  filter(!is.na(SR_B7))

This leaves 0.9% of the Landsat 7 labels as inconsistent. Let’s do a quick sanity check to make sure that we’ve dropped values that are inconsistent between pulls and where any band value is greater than 1:

LS7_filtered <- LS7 %>% 
  filter(# filter data where the repull data and user data match
         (B1 == SR_B1 & B2 == SR_B2 & B3 == SR_B3 & 
           B4 == SR_B4 & B5 == SR_B5 & B7 == SR_B7),
         # or where any re-pulled band value is greater than 1, which isn't a valid value
         if_all(LS57_ee,
                ~ . <= 1))

And plot:

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

And now let’s look at the data by class:

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

We aren’t actually modeling “other” (not sufficient observations to classify) or “shorelineContamination” (we’ll use this later to block areas where there is likely shoreline contamination in the AOI), so let’s drop those categories and look at the data again. We’ll also drop the B1-B7 columns here.

LS7_for_class_analysis <- LS7_filtered %>% 
  filter(!(class %in% c("other", "shorelineContamination"))) %>% 
  select(-c(B1:B7))
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

Check for systemic volunteer inconsistencies

Let’s also go back and check to see if there is any pattern to the inconsistent labels.

vol_init n_tot_labs n_dates
BGS 5 4
LAE 4 3
LRCP 5 2
SKS 2 2
ANK 1 1
FYC 1 1
HAD 1 1

There seem to be just a few inconsistencies here and across multiple dates. This could just be a processing difference (if there happened to be an update to a scene since users pulled these data or if these were on an overlapping portion of two scenes). I’m not concerned about any systemic errors here that might require modified data handling for a specific scene or contributor.

Outlier handling

There are statistical outliers within this dataset and they may impact the interpretation of any statistical testing we do. Let’s see if we can narrow down when those outliers and/or glean anything from the outlier data that may be applicable to the the application of the algorithm. Outliers may be a systemic issue (as in the scene is an outlier), it could be a user issue (a user may have been a bad actor), or they just might be real. This section asks those questions. The “true outliers” that we dismiss from the dataset will also be used to help aid in interpretation/application of the algorithm across the Landsat stack, so it is important to make notes of any patterns we might see in the outlier dataset.

## [1] "Classes represented in outliers:"
## [1] "darkNearShoreSediment"  "lightNearShoreSediment" "offShoreSediment"      
## [4] "openWater"

Okay, 89 outliers (>1.5*IQR) out of 1638 - and they are all from non-cloud groups.

Systemic contributor/scene errors

Are there any contributors that show up more than others in the outliers dataset?

LS7_vol <- LS7_for_class_analysis %>% 
  group_by(vol_init) %>% 
  summarise(n_tot = n()) %>% 
  arrange(-n_tot)
LS7_out_vol <- outliers %>% 
  group_by(vol_init) %>% 
  summarise(n_out = n()) %>% 
  arrange(-n_out)
full_join(LS7_vol, LS7_out_vol) %>% 
  mutate(percent_outlier = n_out/n_tot*100) %>% 
  arrange(-percent_outlier) %>% 
  kable()
vol_init n_tot n_out percent_outlier
SKS 384 32 8.333333
BGS 498 32 6.425703
LRCP 357 22 6.162465
LAE 86 2 2.325581
HAD 75 1 1.333333
FYC 112 NA NA
AMP 70 NA NA
ANK 56 NA NA

These are along the same lines as the LS5 data, at or below 10% and generally the more labels, the more outliers.

How many of these outliers are in specific scenes?

LS7_out_date <- outliers %>% 
  group_by(date, vol_init) %>% 
  summarize(n_out = n())
LS7_date <- LS7_for_class_analysis %>% 
  filter(class != "cloud") %>% 
  group_by(date, vol_init) %>% 
  summarise(n_tot = n())
LS7_out_date <- left_join(LS7_out_date, LS7_date) %>% 
  mutate(percent_outlier = n_out/n_tot*100) %>% 
  arrange(-percent_outlier)
LS7_out_date %>% 
  kable()
date vol_init n_out n_tot percent_outlier
2007-07-15 SKS 26 84 30.952381
2015-07-05 BGS 11 40 27.500000
2015-10-09 BGS 7 39 17.948718
2020-08-03 BGS 8 71 11.267606
2013-08-16 LRCP 12 108 11.111111
2016-11-12 LAE 2 18 11.111111
2004-06-20 LRCP 6 67 8.955224
2006-09-14 BGS 5 77 6.493506
2005-09-27 SKS 5 82 6.097561
2003-05-01 LRCP 4 96 4.166667
1999-10-29 HAD 1 64 1.562500
2011-04-05 BGS 1 70 1.428571
2017-07-26 SKS 1 82 1.219512

There are two scenes here that have very high outliers (>20% labels are outliers) - perhaps there is something about the AC in these particular scenes? or the general scene quality? Let’s look at the scene-level metadata

LS7_out_date %>% 
  filter(percent_outlier > 20) %>% 
  select(date, vol_init) %>% 
  left_join(., LS7) %>% 
  select(date, vol_init, DATA_SOURCE_AIR_TEMPERATURE:max_cloud_cover) %>% 
  distinct() %>% 
  kable()
date vol_init DATA_SOURCE_AIR_TEMPERATURE DATA_SOURCE_ELEVATION DATA_SOURCE_OZONE DATA_SOURCE_PRESSURE DATA_SOURCE_REANALYSIS DATA_SOURCE_WATER_VAPOR SENSOR_MODE_SLC CLOUD_COVER_list IMAGE_QUALITY_list mean_cloud_cover max_cloud_cover
2007-07-15 SKS NCEP GLS2000 TOMS NCEP GEOS-5 FP-IT NCEP OFF 38; 43 9 40.5 43
2015-07-05 BGS NCEP GLS2000 TOMS NCEP GEOS-5 FP-IT NCEP OFF 33; 56 9 44.5 56

Image quality is high across the board, some more consistent >30% cloud cover, but there is nothing egregious or obvious here.

How many bands are represented in each labeled point that is an outlier? If there are outliers amongst the RGB bands (how users labeled data), there is probably a systemic problem. If the outliers are in singular bands, especially those that are not in the visible spectrum, we can dismiss the individual observations, and probably assert that the scene as a whole is okay to use in training. First pass, if there are 3 or more bands deemed outliers, let’s look at the bands that are outliers:

date class vol_init user_label_id n_bands_out bands_out
2020-08-03 openWater BGS 4479 4 SR_B1; SR_B2; SR_B3; SR_B4
2003-05-01 darkNearShoreSediment LRCP 4123 3 SR_B4; SR_B5; SR_B7
2007-07-15 lightNearShoreSediment SKS 1061 3 SR_B4; SR_B5; SR_B7
2007-07-15 lightNearShoreSediment SKS 1062 3 SR_B4; SR_B5; SR_B7
2007-07-15 lightNearShoreSediment SKS 1085 3 SR_B4; SR_B5; SR_B7
2015-10-09 openWater BGS 2545 3 SR_B1; SR_B2; SR_B3
2015-10-09 openWater BGS 2546 3 SR_B1; SR_B2; SR_B3
2015-10-09 openWater BGS 2548 3 SR_B1; SR_B2; SR_B3
2015-10-09 openWater BGS 2549 3 SR_B1; SR_B2; SR_B3
2015-10-09 openWater BGS 2550 3 SR_B1; SR_B2; SR_B3
2015-10-09 openWater BGS 2551 3 SR_B1; SR_B2; SR_B3
2016-11-12 offShoreSediment LAE 1037 3 SR_B4; SR_B5; SR_B7
2020-08-03 openWater BGS 4476 3 SR_B1; SR_B2; SR_B3
2020-08-03 openWater BGS 4477 3 SR_B1; SR_B2; SR_B3
2020-08-03 openWater BGS 4478 3 SR_B1; SR_B2; SR_B3

Let’s group by image date and volunteer and tally up the number of labels where at least 3 bands where outliers:

date vol_init n_labels
2015-10-09 BGS 6
2020-08-03 BGS 4
2007-07-15 SKS 3
2003-05-01 LRCP 1
2016-11-12 LAE 1

Nothing to write home about here.

QA Pixels

Do any of the labels have QA pixel indications of contamination? Let’s see if the medium certainty classification in the QA band is useful here:

LS7_for_class_analysis %>% 
  mutate(QA = case_when(cirrus_conf >=2 ~ "cirrus",
                   snowice_conf >= 2 ~ "snow/ice",
                   cloudshad_conf >= 2 ~ "cloud shadow",
                   cloud_conf >= 2 ~ "cloud",
                   TRUE ~ "clear")) %>% 
  group_by(QA) %>% 
  filter(class != "cloud") %>% 
  summarize(n_tot = n()) %>% 
  kable()
QA n_tot
clear 1168
cloud shadow 22
snow/ice 2

And how about high certainty:

LS7_for_class_analysis %>% 
  mutate(QA = case_when(cirrus_conf >= 3 ~ "cirrus",
                   snowice_conf >= 3 ~ "snow/ice",
                   cloudshad_conf >= 3 ~ "cloud shadow",
                   cloud_conf >= 3 ~ "cloud",
                   TRUE ~ "clear")) %>% 
  group_by(QA) %>% 
  filter(class != "cloud") %>% 
  summarize(n_tot = n()) %>% 
  kable()
QA n_tot
clear 1168
cloud shadow 22
snow/ice 2

What about low confidence?

LS7_for_class_analysis %>% 
  mutate(QA = case_when(snowice_conf >= 1 ~ "snow/ice",
                   cloudshad_conf >= 1 ~ "cloud shadow",
                   cirrus_conf >= 1 ~ "cirrus",
                   cloud_conf >= 1 ~ "cloud",
                   TRUE ~ "clear")) %>% 
  group_by(QA) %>% 
  filter(class != "cloud") %>% 
  summarize(n_tot = n()) %>% 
  kable()
QA n_tot
snow/ice 1192

Low confidence is NOT helpful! Let’s move forward with medium confidence and look at the flagged data from all classes except cloud:

LS7_qa_flagged <- LS7_for_class_analysis %>% 
  filter((cirrus_conf >= 2 |
           snowice_conf >= 2 |
           cloudshad_conf >= 2 |
           cloud_conf >= 2),
    class != "cloud") %>%
  group_by(date, vol_init) %>% 
  summarise(n_qa_flagged = n()) %>% 
  arrange(-n_qa_flagged)
LS7_tot <- LS7_for_class_analysis %>% 
  group_by(date, vol_init) %>% 
  filter(class != "cloud") %>% 
  summarise(n_tot_labels = n())
left_join(LS7_qa_flagged, LS7_tot) %>% 
  mutate(percent_qa_flagged = round(n_qa_flagged/n_tot_labels*100, 1)) %>% 
  arrange(-percent_qa_flagged) %>% 
  kable()
date vol_init n_qa_flagged n_tot_labels percent_qa_flagged
2000-06-25 LAE 5 5 100.0
2020-08-03 BGS 8 71 11.3
2015-10-09 BGS 4 39 10.3
2008-09-03 FYC 2 24 8.3
2020-09-04 FYC 2 27 7.4
2015-07-05 BGS 1 40 2.5
2005-09-27 SKS 2 82 2.4

We don’t want to be using data that has QA flags for any pixel that isn’t labeled cloud. Let’s look at the image with >20% QA flag labels:

LS7_for_class_analysis Woof! Do these pixels also have high opacity?

LS7_for_class_analysis %>% 
  filter(date == "2000-06-25", class != "cloud",
         (cirrus_conf >= 2 |
           snowice_conf >= 2 |
           cloudshad_conf >= 2 |
           cloud_conf >= 2)) %>% 
  select(class, SR_ATMOS_OPACITY, cirrus_conf, cloud_conf, cloudshad_conf, snowice_conf) %>% 
  kable()
class SR_ATMOS_OPACITY cirrus_conf cloud_conf cloudshad_conf snowice_conf
darkNearShoreSediment 0.383 0 1 3 1
darkNearShoreSediment 0.382 0 1 3 1
darkNearShoreSediment 0.383 0 1 3 1
darkNearShoreSediment 0.383 0 1 3 1
darkNearShoreSediment 0.383 0 1 3 1

Yup! all above 0.3, so these will get tossed in QA.

What about the outliers?

outliers %>% 
  mutate(QA = case_when(snowice_conf >= 2 ~ "snow/ice",
                   cloudshad_conf >= 2 ~ "cloud shadow",
                   cirrus_conf >= 2 ~ "cirrus",
                   cloud_conf >= 2 ~ "cloud",
                   TRUE ~ "clear")) %>% 
  group_by(QA) %>% 
  filter(class != "cloud") %>% 
  summarize(n_out_tot = n()) %>% 
  kable()
QA n_out_tot
clear 87
cloud shadow 2

And let’s look at atmospheric opacity:

outliers %>% 
  filter(class != "cloud") %>% 
  filter(SR_ATMOS_OPACITY > 0.3) %>% 
  pluck("SR_ATMOS_OPACITY") %>% 
  range(na.rm = T)
## [1] 0.329 0.845

High opacity in 42 of the 89 outliers.

Clouds

How many of these outliers have near-pixel clouds (as measured by ST_CDIST)?

There are 11 labels (12.4% of oultiers) that aren’t “cloud” in the outlier dataset that have a cloud distance <500m and 44 labels (2.7%) in the whole dataset that have a cloud distance <500m. While there are definitely closer clouds to the outlier dataset, it’s not a sufficient disparity to filter data by this index.

How many of the outliers have high cloud cover, as reported by the scene-level metadata? Note, we don’t have the direct scene cloud cover associated with individual labels, rather a list of the scene level cloud cover values associated with the AOI.

The outlier dataset contains 8 (9%) where the max cloud cover was > 75% and 15 (16.9%) where the mean cloud cover was > 50%. The filtered dataset contains 76 (4.6%) where max was >75% and 115 (7%) where the mean cloud cover was > 50%. While there is a greater instance of higher CLOUD_COVER in the outliers, it’s not a large enough portion of the outlier dataset to say that we should just toss scenes of either case above.

Training dataset implications

Based on the above review, any label with SR_ATMOS_OPACITY value >= 0.3, or a QA flag for clouds, cloud shadow, or snow ice should be eliminated from this dataset.

LS7_training_labels <- LS7_for_class_analysis %>% 
  filter(class == "cloud" |
           (SR_ATMOS_OPACITY < 0.3 &
            cloud_conf < 2 &
            cloudshad_conf <2 & 
            snowice_conf <2))

Testing for inter-class differences

We do want to have an idea of how different the classes are, in regards to band data. While there are a bunch of interactions that we could get into here, for the sake of this analysis, we are going to analyze the class differences by band.

Kruskal-Wallis assumptions:

  1. Data are non-Normal or have a skewed distribution
  2. There must be at least two independent groups.
  3. Data have a similar distribution across groups.
  4. Data are independent, the groups shouldn’t have a relationship to one another
  5. Each group should contain at least 5 observations

ANOVA assumptions:

  1. data are distributed normally
  2. data are independent
  3. variance across groups is similar

We can’t entirely assert sample independence and we know that variance and distribution is different for “cloud” labels, but those data also are visibly different from the other classes.

In order to systematically test for differences between classes and be able to intepret the data, we will need to know some things about our data:

  1. Are the data normally distributed (Shapiro-Wilkes)?
  2. Are there outliers that may impact interpretation?
  3. If data is non-normal, perform Kruskal-Walis test; otherwise ANOVA
  4. if the null is rejected (and there is a difference in at least one class), perform post-hoc test for pairwise comparison (Dunn test for both)

With this workflow, most classes are statistically different - below are the cases where the pairwise comparison were not deemed statistically significant:

## # A tibble: 8 × 9
##   band  group1          group2    n1    n2 statistic       p  p.adj p.adj.signif
##   <chr> <chr>           <chr>  <int> <int>     <dbl>   <dbl>  <dbl> <chr>       
## 1 SR_B1 darkNearShoreS… offSh…    72   295    0.0870 0.931   1      ns          
## 2 SR_B2 darkNearShoreS… offSh…    72   295   -1.54   0.123   1      ns          
## 3 SR_B3 darkNearShoreS… light…    72   292    2.17   0.0297  0.297  ns          
## 4 SR_B4 darkNearShoreS… light…    72   292    0.368  0.713   1      ns          
## 5 SR_B5 darkNearShoreS… light…    72   292   -0.582  0.560   1      ns          
## 6 SR_B5 offShoreSedime… openW…   295   302   -2.80   0.00512 0.0512 ns          
## 7 SR_B7 darkNearShoreS… light…    72   292   -0.900  0.368   1      ns          
## 8 SR_B7 offShoreSedime… openW…   295   302   -2.43   0.0153  0.153  ns

There is some consistency here: “darkNearShoreSediment” is often not different from other sediment types by band. It is entirely possible that band interactions overpower these non-significant differences.

Let’s look at the boxplots for the training labels, dropping the cloud labels to see the ranges better:

LS7_training_labels_no_clouds <- LS7_training_labels %>% 
  filter(class != "cloud")
pmap(.l = list(data = list(LS7_training_labels_no_clouds),
               data_name = list("LANDSAT_5"),
               band = LS57_ee),
     .f = make_class_comp_plot)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

There are 1 outliers in SR_B1, 1 in SR_B2, 1 in SR_B3, 8 in SR_B4, 53 in SR_B5, and 51 in SR_B7.

DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment

DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment

There are definitely some varying patterns here, let’s zoom in on the sediment classes.

DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment

DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment

Export the training labels

Things to note for Landsat 7:

  • pixels with QA flags should be dismissed from model application

  • pixels with SR_ATMOS_OPACITY > 0.3 should be dismissed from model application

write_rds(LS7_training_labels, paste0("data/labels/LS7_labels_for_tvt_", outlier_version, ".RDS"))